Testing the Robustness of Online Word Segmentation: Effects of Linguistic Diversity and Phonetic Variation
نویسندگان
چکیده
Models of the acquisition of word segmentation are typically evaluated using phonemically transcribed corpora. Accordingly, they implicitly assume that children know how to undo phonetic variation when they learn to extract words from speech. Moreover, whereas models of language acquisition should perform similarly across languages, evaluation is often limited to English samples. Using child-directed corpora of English, French and Japanese, we evaluate the performance of state-of-the-art statistical models given inputs where phonetic variation has not been reduced. To do so, we measure segmentation robustness across different levels of segmental variation, simulating systematic allophonic variation or errors in phoneme recognition. We show that these models do not resist an increase in such variations and do not generalize to typologically different languages. From the perspective of early language acquisition, the results strengthen the hypothesis according to which phonological knowledge is acquired in large part before the construction of a lexicon.
منابع مشابه
The Influence of Word Retrieval and Planning on Phonetic Variation: Implications for Exemplar Models.
Over the past several decades, an increasing number of empirical studies have documented the interaction of information across the traditional linguistic modules of phonetics, phonology, and lexicon. For example, the frequency with which a word occurs influences its phonetic properties of its sounds; high frequency words tend to be reduced relative to low frequency words. Lexicalist Exemplar Mo...
متن کامل"blind" Speech Segmentation: Automatic Segmentation of Speech without Linguistic Knowledge
A new automatic speech segmentation procedure, called the \Blind" speech segmentation, is presented. This procedure allows a speech sample to be segmented into sub-word units without the knowledge of any linguistic information (such as, orthographic or phonetic transcription). Hence, this procedure involves nding the optimal number of sub-word segments in the given speech sample, before locatin...
متن کاملFrom segmentation bootstrapping to transcription-to-word conversion
The mapping of a raw phonetic transcription to an orthographic word sequence is carried out in three steps: First, a syllable segmentation of the transcription is bootstrapped, based on unsupervised subtractive learning. Then, the syllables are grouped to word entities guided by non-linguistic distributional properties. Finally, the phonetic word segmentations are mapped onto entries of a canon...
متن کاملSpeech Recognition using Acoustic Landmarks and Binary Phonetic Feature Classifiers
In spite of decades of research, Automatic Speech Recognition (ASR) is far from reaching the goal of performance close to Human Speech Recognition (HSR). One of the reasons for unsatisfactory performance of the state-of-the-art ASR systems, that are based largely on Hidden Markov Models (HMMs), is the inferior acoustic modeling of low level or phonetic level linguistic information in the speech...
متن کاملHow discourse context shapes the lexicon: Explaining the distribution of Spanish f- / h- words
Using a corpus of Medieval Spanish text, we examine factors affecting the Modern Standard Spanish outcome of the initial /f/ in Latin FV-words. Regression analyses reveal that the frequency of a word's use in extralexical phonetic reducing environments and lexical stress patterns significantly predict the modern distribution of f-([f]) and h-(Ø) in the Spanish lexicon of FV-words. Quantificatio...
متن کامل